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Abstract 

Motivated by the lossy compression of an active-vision video stream, we consider the problem of finding the 
rate-distortion function of an arbitrarily varying source (AVS) composed of a finite number of subsources with known 
distributions. Berger's paper 'The Source Coding Game', IEEE Trans. Inform. Theory, 1971, solves this problem 
under the condition that the adversary is allowed only strictly causal access to the subsource realizations. We consider 
the case when the adversary has access to the subsource realizations non-causally. Using the type-covering lemma, 
. . . this new rate-distortion function is determined to be the maximum of the IID rate-distortion function over a set of 

I source distributions attainable by the adversary. We then extend the results to allow for partial or noisy observations 

. of subsource realizations. We further explore the model by attempting to find the rate-distortion function when the 

' adversary is actually helpful. 

^ . Finally, a bound is developed on the uniform continuity of the IID rate-distortion function for finite-alphabet 

' sources. The bound is used to give a sufficient number of distributions that need to be sampled to compute the 

. rate-distortion function of an AVS to within a certain accuracy. The bound is also used to give a rate of convergence 

for the estimate of the rate-distortion function for an unknown IID finite-alphabet source . 
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Rate-distortion, arbitrarily varying source, uniform continuity of rate-distortion function, switcher, lossy com- 
pression, source coding game, estimation of rate-distortion function 

I. Introduction 

A. Motivation 



> 

' Active vision/sensing/perception [2] is an approach to computer vision, the main principle of which is that sensors 
should choose to explore their environment actively based on what they currently sense or have previously sensed. 
^ , As Bajcsy states it in [2], "We do not just see, we look." The contrast to passive sensors can be seen by comparing a 
fixed security camera (non-active) to a person holding a camera (active). Even if the person is otherwise stationary, 
they may zoom the camera into any part of their visual field to obtain a better view (e.g. if they see a trespasser). 
There is also the possibility that the sensor has noncausal information about the environment. For example, a 
cameraman at a sporting event generally has only causal knowledge of the environment. A cameraman on a movie 
set, however, has noncausal information about the environment through the script. The noncausal information can 
be advantageous to the cameraman in (actively) capturing the important features of a scene. 



There is a subtle distinction between causal and strictly causal information and this distinction is related to the 
. time-scales on which the environment changes. A causal active sensor knows both the present and the past, but 
a strictly causal one knows only the past. If the environment changes at a pace much slower than the sensor can 
actively look, there is essentially no difference between knowing the immediate past and knowing the present. 
However, if the environment changes at a pace faster then the sensor can actively look (and process information), 
there is intuitively a substantial difference between knowing only the past and knowing the present. 

As motivation for this paper, we are interested in the fixed-rate lossy compression of an active-vision source. In 
reality, there are many interesting questions that need to be answered to truly understand the problem, including: 

• What is the relevant distortion measure for active-video? 

• Is there a distinction between the compression of an active-video source for use by the closed-loop control 
system that points the camera as compared to compression for later off-Une use? 

• How to model the entire plenoptic function that the active-video source will be dynamically sampling? [3] 

This work was supported by an NSF Graduate Research Fellowship. An earlier version [1] of the material in this paper was presented in 
part at the 2007 International Symposium on Information Theory, Nice, France. 
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It is also clear that the core issues here extend well beyond vision. They also arise in a series of sensor- 
measurements that were dynamically sampled by a distributed sensor network as well as the case of measurements 
taken by an autonomously moving sensor that chooses where to go in part based on what it is observing. More 
provocatively, similar issues of active-sources arise when the successive source symbols are brought by customers, 
each of which has free will and can choose among competing codecs for compression^ 

We concentrate entirely on the simplest aspect of the problem: what is the impact on the rate-distortion function 
of having the source being actively sampled by an entity that knows something about the realizations of the 
environment as it does the sampling. Thus, we assume an overly simplified traditional rate-distortion setting with 
known finite alphabets and bounded distortion measures. The goal is the traditional block-coding one: meet an 
average distortion constraint with high probability using as little rate as possible. 

The modeling question is whether or not it is worth building a detailed model for how the active-source is going 
to be doing its dynamic sampling of the source. Three basic ways to model the goals of the camera are worst case 
(adversarial), random (agnostic), and helpful (joint optimization of camera and coding system). Admittedly, the 
most interesting problems involve the compression of sources with memory, but following tradition we focus on 
memoryless sources to understand the basic differences between active and non-active sources for lossy compression. 

In the context of active-vision, a strictly causal adversary pointing a camera is intuitively no more threatening than 
a robot randomly pointing the camera when the scene being captured is memoryless. This intuition was formally 
proved correct in [5] by Berger as he determined the rate-distortion function for memoryless sources and a strictly 
causal adversarial model. This paper determines the rate-distortion function for the additional cases of causal and 
non-causal adversaries. The model is then extended to allow only noisy observations by the adversary doing the 
sampling of the scene. To see the impact of the details of the dynamic sampling on the rate-distortion function, the 
paper also considers how the rate-distortion function changes when the 'adversary' is actually a helpful party. 

B. Causality in information theory 

The issue of causality arises naturally in several major problems of information theory where noncausal knowledge 
of the realizations of randomness in the problem can be advantageous. Shannon [6] studied the problem of 
transmitting information over a noisy channel with memoryless state parameter revealed to the encoder causally. 
Gelfand and Pinsker [7] studied the same problem with the state parameter available to the encoder noncausally. In 
general, the capacity is larger when the channel state is available noncausally to the encoder. When the channel state 
corresponds to Gaussian interference known noncausally, Costa [8] showed that the capacity is the same as when 
the interference is not present at all. Willems ([9], [10]) gave achievable strategies when the Gaussian interference is 
known only causally. Lattice strategies for both causal and non-causal knowledge of the interference are discussed 
in [11], but the advantage of finitely anticipatory knowledge of interference is not yet explicitly understood even 
in the case of Gaussian interference. 

Agarwal et.al. [12] find the capacity for an arbitrarily varying channel whose input is constrained to look like an 
IID source with known distribution. The adversary is constrained to distort over a block to at most some (additive) 
distortion, but is not constrained to act causally. [12] shows that the rate-distortion function turns out to be the 
capacity for this channel. Because the codewords are constrained to look IID, simulating the action of a causal 
memoryless channel turns out to be sufficient for the adversary to minimize the capacity. 

Causality also has implications for the problem of lossy source coding, as studied by Neuhoff and Gilbert [13]. 
There, for an IID source, causal source codes generally require a higher rate to achieve distortion D than non- 
causal source codes. It is also shown that optimal causal source codes can be constructed by time-sharing between 
memoryless codes. Hence, there is a rate penalty for using causal coders (as opposed to noncausal coders), but 
no further penalty for using memoryless coders. Similar results have been derived by Weissman and Merhav [14] 
for lossy source coding with causal and noncausal side information. In [13], the channel was implicitly assumed 
to noiseless and binary. Tatikonda, et.al [15] show that even if the channel is matched properly to achieve the 
sequential rate-distortion function, there is a penalty for using causal coders when the sources have memory. For 
example, they show that proper matching for a Gauss-Markov source is a Gaussian channel with feedback, but the 
rate-distortion performance with this causal matching still does not meet the performance of noncausal coders. 

'This is related to a particularly odd kind of moral hazard in private health insurance markets. Somewhat counterintuitively, private health 
insurers actually have a disincentive to provide good treatment of chronic conditions since they fear attracting patients that are intrinsically 
likely to get sick! [4] 
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C. Results and organization of paper 

Section |ll] sets up the notation, model and briefly reviews the Uterature on lossy compression of arbitrarily 
varying sources. Section |lll] gives the rate-distortion function for an AVS when the adversary has noncausal access 
to realizations of a finite collection of memoryless subsources and can sample among them. As shown in Theorem 
13.11 the rate-distortion function for this problem is the maximization of the IID rate-distortion function over the 
memoryless distributions the adversary can simulate. The adversary requires only causal information to impose this 
rate-distortion function. This establishes that when the subsources are memoryless, the rate-distortion function can 
strictly increase when the adversary has knowledge of the present subsource realizations, but no further increase 
occurs when the adversary is allowed knowledge of the future. 

We then extend the AVS model to include noisy or partial observations of the subsource realizations and determine 
the rate-distortion function for this setting in Section |IVl As shown in Theorem 14. 1[ the form of the solution is the 
same as for the adversary with clean observations, with the set of attainable distributions essentially being related 
to the original distributions through Bayes' rule. 

Next, Section |V] explores the problem when the goal of the active sensor is to help the coding system achieve a 
low distortion. Theorem 15.11 gives a characterization of the rate-distortion functions if the helper is fully noncausal 
in terms of the rate-distortion function for an associated lossy compression problem. As a corollary, we also give 
bounds for the cases of causal observations and noisy observations. 

Simple examples illustrating these results are given in Section |Vll In Section |VII[ we discuss how to compute 
the rate-distortion function for arbitrarily varying sources to within a given accuracy using the uniform continuity 
of the IID rate-distortion function. The main tool there is an explicit bound on the uniform continuity of the IID 
rate-distortion function that is of potentially independent interest. Finally, we conclude in Section IVIIII 

All the problems in this paper are studied in the context of fixed-length block coding. Variable-length coding 
could perform better in a universal sense by using only as much rate as required when the active sensor is not 
adversarial. However, we are interested in determining upper and lower bounds for the rate that active sensors might 
end up needing and for this purpose, fixed-length block coding is appropriate. 



II. Problem Setup 

A. Notation 

Let X and X be the finite source and reconstruction alphabets respectively. Let x" = {xi, . . . ,Xn) denote an 
arbitrary vector from X"' and x" = (xi, . . . an arbitrary vector from X^. When needed, x'^ = (xi, . . . 
will be used to denote the first k symbols in the vector x". 

Let d : Af X ^ ^ [0, d*] be a distortion measure on the product set ^ x Af with maximum distortion d* < oo. 
Let 

d= _ min_ d{x^x) (1) 

{x,x): d(x,x)>f) 

be the minimum nonzero distortion. Define dn : Af" x X"" ^ [0, d*] for n > 1 to be 

1 " 

dn(x",X") = - (2) 

Let V{X) be the set of probability distributions on X, let Vn{X) be the set of types of length n strings from 
X, and let W be the set of probability transition matrices from X to X. Let ^ 'Pn{^) be the empirical type 
of a vector x". For a p € V{X), let 

Drainip) = ^ p{x) min d{x, x) (3) 

be the minimum average distortion achievable for the source distribution p. The rate-distortion function of p G ^{X) 
at distortion D > D^nm{p) with respect to distortion measure d is defined to be 

R(p,D)= min I(p,W), (4) 
W&W(p,D) 
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Fig. 1. A class of models for an AVS. The switcher can set the switch position according to the rules of the model. 



where 



W(p, D) = Iw :^Y1 p{x)W{x\x)d{x, x) < d\ 
and I{p, W) is the mutual informatioij^ 



I{p,W) = ^ p{x)W(x\x) In 



W{x\x) 



(5) 



(6) 



Let B = {x"(l), . . . , Sl"'{K)} be a codebook with K length-n vectors from A"". Define 

d„(x";^)= mind„(x",S"). (7) 

If B is used to represent an IID source with distribution p, then the average distortion of B is defined to be 

d{B)= P(x"K(x";S) =E[d„,(x";^?)], (8) 



where 



p(x")=np(xfc). 



(9) 



k=l 



For n > 1, D > Dmin(p)> let K{n,D) be the minimum number of codewords needed in a codebook B C so 
that d{B) < D. By convention, if no such codebook exists, K{n, D) = oo. Let the rate-distortion functiorH of an 
IID source be R{D) = limsup^ - \nK{n, D). Shannon's rate-distortion theorem ([16], [17]) states that for all n, 
i In K{n, D) > R{p, D) and 

liminf ilnif(n,P>) = R{D) = R{p,D). (10) 



n^oo n 



B. Arbitrarily varying sources 

The source coding game is a two-player game introduced in [5] by Berger as a model for an AVS. The two 
players are called the 'switcher' and 'coder'. In a coding context, the coder corresponds to the designer of a lossy 
source code and the switcher corresponds to a potentially malicious adversary pointing the camera. 

Figure [T] shows a model of an AVS. There are m IID 'subsources' with common alphabet X. In [5], the subsources 
are assumed to be independent, but that restriction turns out not to be requirecj^. There can be multiple subsources 
governed by the same distribution. In that sense, the switcher has access to a list of m subsources, rather than a set 

^We use natural log, denoted In, and nats in most of the paper. In examples only, we use bits. 

^We define R{Dmin(p)) ~ \-iraoiD^^„{p) R{D). This is equivalent to saying that a sequence of codes represent a source to within distortion 
D if their average distortion is tending to D in the limit. The only distortion where this distinction is meaningful is -Dmin(p). 

''in [5], the motivation was multiplexing data streams and independence is a reasonable assumption, but the proof does not require it. 
Active sources, however, would likely choose among correlated subsources in practice. 
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of m different distributions. The marginal distributions of the m subsources are known to be {pi]^i and we let 
Q = {pi, . . . ,Pm}- Let P{xi^i, . . . , be the joint probability distribution for the IID source {(xi fe, . . . , Xm,k)}k- 
Fix an n > 1 and consider a block of length n. We let ^ denote the output of the l^^ subsource at time k. We will 
use x" to denote the vector (x^^i, . . . , xi^n)- At each time k, the AVS outputs a letter Xk which is determined by the 
position of the switch inside the AVS. The switch positions are denoted s" = (si, . . . , Sn) with S {1, 2, . . . , m} 
for each 1 < k <n. With this notation, x^ = Xg^^k for 1 < A; < n. 

The switcher can set the switch position according to the model for the AVS. For example, in the compound 
source setting of Sakrison [18], the switcher chooses s G {1, . . . ,m} and sets = s for 1 < A; < n. The main 
case analyzed in [5] allowed the switcher to change Sk arbitrarily, but the switcher only had knowledge at time k 
of s'^~^ and x'^"^. That is, the switcher only had knowledge of past switch positions and past AVS outputs before 
deciding the switch position at each time. One of the cases analyzed in this paper is termed full-lookahead, where 
the switcher makes a (possibly random) decision about the full s" with knowledge of x" , X2 , . . . , xj^ beforehand. 
The other case is termed 1-step lookaheacjfi where for each k, Sk is a (possibly random) function of xj, . . . ,xj^. 
The switcher may or may not have knowledge of the codebook, but this knowledge turns out to be inconsequential 
for the rate-distortion function. 

The coder's goal is to design a codebook B of minimal size to represent x" to within distortion D on average. 
The codebook must be able to do this for every allowable strategy for the switcher according to the model. Define 

r i3c-?", EK(x";e)] <Z) \ 
M{n, D) = min I \B\ : for all allowable > . (11) 

[ switcher strategies J 

Here, E[d„(x"; jB)] is defined to be (^g„ ^(8", x")) (i„(x"; ^), where ^(s^jx") is an appropriate prob- 
ability mass function on {!,... ,m}" x X'^ that agrees with the model of the AVS. When the switcher has full 
lookahead, P(s",x") must be composed of conditional distributions of the form 

n 

P(s",x"|x^, . . . , x« ) = P(s-|x?, . . . , x:;) ■ n l{xk = Xs,,k). (12) 

k=l 

Then, P(s",x") is simply obtained by averaging over (x", . . . ,x^). 

P(s", x") = J2 ( n ^(^i-'^' ■ • • ' ^"-'^O ) ^'(s^lx?, . . . , X- ). (13) 

(x5',...,x;^) \fc=l / 

For a set of distributions Q C V{X), let -Dmin(Q) = sup^gg I?min(p)- We are interested in the exponential rate of 
growth of M(n, D) with n. Define the rate-distortion function of an AVS to be 

R{D) = lim sup - In M(n, D). (14) 

n— »oo ^ 

In every case considered, it will be also be clear that R{D) = liminf„^oo - lnM(n, D). 



C. Literature Review 

a) One IID source: Suppose m = 1. Then there is only one IID subsource pi = p and the switch position is 
determined to be = 1 for all time. This is exactly the classical rate-distortion problem considered by Shannon 
[16], and he showed 

R{D) = R{p,D). (15) 

Computing R{p,D) can be done with the Blahut-Arimoto algorithm [19], and also falls under the umbrella of 
convex programming. 

^We use the term 1-step lookahead even though this term is meant to represent the causal (but not strictly) switcher. In most of the 
information theory literature, 'causal' knowledge includes knowledge of the present. 
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b) Compound source: Now suppose that m > 1, but the switcher is constrained to choose = s G {1, . . . , m} 
for all k. That is, the switch position is set once and remains constant afterwards. Sakrison [18] studied the rate- 
distortion function for this class of compound sources and showed that planning for the worst case subsource is 
both necessary and sufficient. Hence, for compound sources, 

R{D) = maxR{p,D). (16) 

peg 

This result holds whether the switch position is chosen with or without knowledge of the realizations of the m 
subsources. Here, R{D) can be computed easily since m is finite and each individual R{p, D) can be computed. 

c) Causal adversarial source: In Berger's setup [5], the switcher is allowed to choose G {l,...,m} 
arbitrarily at any time k , but must do so in a strictly causal manner without access to the current time step's 
subsource realizations. More specifically, the switch position s^. is chosen as a (possibly random) function of 
(si, . . . , Sk-i) and (xi, . . . , Xk-i)- The conclusion of [5] is that under these rules, 

R{D)= max R{p,D), (17) 

pSconv(CJ) 

where conv(^) is the convex hull of Q. It should be noted that this same rate-distortion function applies in the 
following cases: 

• The switcher chooses at each time k without any observations at all. 

• The switcher chooses Sk as a function of the first k — 1 outputs of all m subsources. 

Note that in ([TV] ). evaluating R{D) involves a maximization over an infinite set, so the computation of R{D) is 
not trivial since R{p, D) is not necessarily a concave n function. A simple, provable, approximate (to any given 
accuracy) solution is discussed in Section IVIII 

III. R{D) FOR THE CHEATING SWITCHER 

In the conclusion of [5], Berger poses the question of what happens to the rate-distortion function when the rules 
are tilted in favor of the switcher. Suppose that the switcher were given access to the m subsource realizations 
before having to choose the switch positions; we call such a switcher a 'cheating switcher'. In this paper, we deal 
with two levels of noncausality and show they are essentially the same when the subsources are IID over time: 

• The switcher chooses Sk based on the realizations of the m subsources at time k. We refer to this case as 
1-step lookahead for the switcher. 

• The switcher chooses (si, . . . , s„) based on the entire length n realizations of the m subsources. We refer to 
this case as full lookahead for the switcher. 

Theorem 3.1: Suppose the switcher has 1-step lookahead or full lookahead. In both cases, for D > D^\^{C), 



where 



R{D) = R{D) = may: R{p,D), (18) 

peC 

peV : VV such that \ . (19) 

vex J 

For D < Dmin(C), R{D) = oo by convention because the switcher can simulate a distribution for which the 
distortion D is infeasible for the coder. 
Remarks: 

• If there are at least two non-deterministic subsources and conv(^) ^ 'P{X), then cony{Q) is a strict subset 
of C, and thus R{D) can strictly increase when the switcher is allowed to look at the present subsource 
realizations before choosing the switch position. Hence, extra rate must be provisioned for active sensors in 
general. 

• As a consequence of the theorem, we see that when the subsources within an AVS are IID, knowledge of 
past subsource realizations is useless to the switcher, knowledge of the current step's subsource realizations is 
useful, and knowledge of future subsource realizations beyond the current step is useless if 1-step lookahead 
is already given. 
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• Note that computing R{D) requires further discussion given in Section |VII[ just as it does for the strictly 
causal case of Berger. 

Proof: We give a short outline of the proof here. See Appendix U for the complete proof. To show R{D) < 
R{D), we use the type-covering lemma from [5]. It says for a fixed type p in Vn{X) and e > 0, all sequences 
with type p can be covered within distortion D with at most exp(n(ii(p, D) + e)) codewords for large enough n. 
Since there are at most (n + distinct types, we can cover all n-length strings with types in C with at most 
exp{n{R{D) + -Li ln(n + 1) + e)) codewords. Furthermore, we can show that types not in C occur exponentially 
rarely even if the switcher has full lookahead, meaning that their contribution to the average distortion can be 
bounded by d* times an exponentially decaying term in n. Hence, the rate needed regardless of the switcher 
strategy is at most R{D) + e with e > arbitrarily small. 

Now, to show R{D) > R{D), we describe one potential strategy for the adversary. This strategy requires only 
1-step lookahead and it forces the coder to use rate at least R{D). For each set V C with V 7^ and |V| < m, 
the adversary has a random rule f{-\V), which is a probability mass function (PMF) on V. At each time k, if 
the switcher observes a candidate set {xi fc, . . . the switcher chooses to output x G {xi fc, . . . ,Xm,k} with 

probability f{x\{xi^k, • • • , Xm,k})- If /?(V) = -P({a;i,fc, • • • , Xm,k} = V), let 

f P{^) = Evc;^,|V|<™/?(V)/(^|V),x E X \ 

V=l per : /(-IV) is a PMF on V, >. (20) 

[ V V s.t. V QX, \V\<m J 

V is the set of IID distributions the AVS can 'simulate' using these memoryless rules requiring 1-step lookahead. 
It is clear by construction that V QC. Also, it is clear that both C and V are convex sets of distributions. Lemma 
11.31 in Appendix U uses a separating hyperplane argument to show V = C. The adversary can therefore simulate 
any IID source with distribution in C and hence R{D) > R{D). ■ 

Qualitatively, allowing the switcher to 'cheat' gives access to distributions p G C which may not be in conv(^). 
Quantitatively, the conditions placed on the distributions in C are precisely those that restrict the switcher from 
producing symbols that do not occur often enough on average. For example, let V = {1} where 1 £ X, and suppose 
that the subsources are independent of each other. Then for every p £ C, 

m 

p{l)>J{pi{l). (21) 

1=1 

WdLiPii^) is the probability that all m subsources produce the letter 1 at a given time. In this case, the switcher 
has no option but to output the letter 1, hence any distribution the switcher mimics must have p(l) > HzILiPKI)- 
The same logic can be applied to all subsets V of X. 

IV. Noisy observations of subsource realizations 

A natural extension of the AVS model is to consider the case when the adversary has noisy access to subsource 
realizations through a discrete memoryless channel before pointing the camera. Since the subsource probability 
distributions are already known, this model is equivalent to one in which the switcher observes a state noiselessly. 
Conditioned on the state, the m subsources output symbols independent of the past according to a conditional 
distribution. This model is depicted in Figure |2] 

The overall AVS is comprised now of a 'state generator' and a 'symbol generator' that outputs m symbols at a 
time. The state generator produces the state at time k from a finite set T. We assume the states are generated 
IID across time with distribution a{t). At time k, the symbol generator outputs (xi_a:, • • • , a^m.fc) according to 
P{xi^k7 ■ ■ ■ ,Xjn,k\'tk)- This model allows for correlation among the subsources at a fixed time. Let pi{-\t),l = 
1, . . . , m, be the marginals of this joint distribution so that conditioned on t^, x/ ^ has marginal distribution pi{-\tk)- 
For an t e T, let g(t) = conv(pi(-|t), . . . ,Pm{-\t)). 

The switcher can observe states either with full lookahead or 1-step lookahead, but these two cases will once 
again have the same rate-distortion function when the switcher is an adversary. So assume that at time k, the 
switcher chooses the switch position Sk with knowledge of t", x^~^, . . . , x^~^. The non-cheating and cheating 
switcher can be recovered as special cases of this model. If the conditional distributions pi{x\t) do not depend 
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Fig. 2. A model of an AVS encompassing both cheating and non-cheating switchers. Additionally, this model allows for noisy observations 
of subsource realizations by the switcher. 



on t, the non-cheating switcher is recovered. The cheating switcher is recovered by setting T = X"^ and letting 
Pi{x\t) = l{x = t{l)) where the state t is an m dimensional vector consisting of the outputs of each subsource. 
With this setup, we have the following extension of Theorem 13.11 

Theorem 4.1: For the AVS problem of Figure |2j where the adversary has access to the states either with 1-step 
lookahead or full lookahead, 

R(D)= max R(p,D), (22) 



where 



Vstates-^P€V{X). ^(.|^)^^(^)^^^^^ |. (23) 



Proof: See Appendix HIl 

One can see that in the case of the cheating switcher of the previous section, the set V of equation (l20l ) equates 
directly with Vstates of equation (l23l) . In that sense, from the switcher's point of view, D is a more natural description 
of the set of distributions that can be simulated than C. Again, computing R{D) in (l22l ) falls into the discussion 
of Section Iwl 



V. The Helpful Switcher 

In general, the active-source may be acting in such a way that optimizes its own objectives. When its objective 
is to output a source sequence that is not well represented by the codebook, we arrive at the traditional adversarial 
setting considered above. The objective of the switcher, however, may vary from adversarial to agnostic to helpful. 
In this section, we consider the helpful cheating switcher. The model is as follows: 

• The coder chooses a codebook that is made known to the switcher. 

• The switcher chooses a strategy to help the coder achieve distortion D on average with the minimum number 
of codewords. We consider the cases where the switcher has full lookahead or 1-step lookahead. 

As opposed to the adversarial setting, a rate R is now achievable at distortion D if there exist switcher strategies and 
codebooks for each n with expected distortion at most D and the rates of the codebooks tend to R. The following 
theorem establishes R{D) if the cheating switcher has full lookahead. 

Theorem 5.1: Let ;f * = {V C : V / 0, | V| < m}. Let p : X* x X ^ [0, d*] be defined by 

p{V,x) = mmd{x,x). (24) 

Let Vk = {xi^k, • • • J Xm,k} for all k. Note that Vj, i = 1, 2, . . . is a sequence of IID random variables with distribution 
/3(V) = P{{xi^i, . . . , Xm.i} = V). Let R*{/3, D) be the rate-distortion function for the IID source with distribution 
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/? at distortion D with respect to the distortion measure />(•,•). For the helpful cheating switcher with full lookahead, 

R{D) = R*{/3,D). (25) 
Proof: Rate-distortion problems are essentially covering problems, so we equate the rate-distortion problem 
for the helpful switcher with the classical covering problem for the observed sets Vi. If the switcher is helpful, has 
full lookahead, and knowledge of the codebook, the problem of designing the codebook is equivalent to designing 
the switcher strategy and codebook jointly. At each time k, the switcher observes a candidate set Vk and must select 
an element from Vk- For any particular reconstruction codeword x*^, and a string of candidate sets (Vi, V2, • • • , Vn), 
the switcher can at best output a sequence x" such that 

(i„(x",X") = iVp(Vfc,Xfc) (26) 
n ^-^ 

k=l 

Hence, for a codebook B, the helpful switcher with full lookahead can select switch positions to output x'^ such 
that 

(i„(x";S) = mm -Y,PiVk,Xk). (27) 

els Tt 

k=l 

Therefore, for the helpful switcher, the problem of covering the X space with respect to the distortion measure 
•) now becomes one of covering the X* space with respect to the distortion measure p{-,-)- ■ 

Remarks: 

• Computing R{D) in (1251 ) can be done by the Blahut-Arimoto algorithm[20]. 

• In the above proof, full lookahead was required in order for the switcher to align the entire output word of 
the source with the minimum distortion reconstruction codeword as a whole. This process cannot be done 
with 1-step lookahead and so the R{D) function for a helpful switcher with 1-step lookahead remains an open 
question, but we have the following corollary of Theorems 13.11 and 15.11 

Corollary 5.1: For the helpful switcher with 1-step lookahead, 

R*(P, D) < RiD) < mill R(p, D) (28) 

Proof: If the switcher has at least 1-step lookahead, it immediately follows from the proof of Theorem 13.11 
that R{D) < miup^c R{Pi D). The question is whether or not any lower rate is achievable. We can make the 
helpful switcher with 1-step lookahead more powerful by giving it n-step lookahead, which yields the lower bound 

R*{P,D). m 

An example in Section IVI-BI shows that in general, we have the strict inequality R*{f3, D) < minpgc R{p, D). 

One can also investigate the helpful switcher problem when the switcher has access to noisy or partial observations 
as in Section |IVl This problem has the added flavor of remote source coding because the switcher can be thought 
of as an extension of the coder and observes data correlated with the source to be encoded. However, the switcher 
has the additional capability of choosing the subsource that must be encoded. For now, this problem is open and 
we can only say that R{D) < mmp^x>,tat^s ^(P;-^)- 

VI. Examples 

We illustrate the results with several simple examples using binary alphabets and Hamming distortion, i.e. X = 
X = {0, 1} and d(x, x) = l(x 7^ x). Recall that the rate-distortion function of an IID binary source with distribution 

(p,l-p), [0, i] is 

ii((l-p.p),0) = {"'<'''-'"<°> "i^l'f . (29) 
where hh{p) is the binary entropy function (in bits for this section). 
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A. Bernoulli 1/4 and 1/3 sources 

Let m = 2 so the switcher has access to two IID BernouUi subsources. Subsource 1 outputs 1 with probability 
1/4 and subsource 2 outputs 1 with probability 1/3, so pi = (3/4,1/4) and p2 = (2/3,1/3). First, we consider 
the switcher as an adversary. Figure |3] shows this example in the traditional strictly causal setting of [5], where the 
switcher gets only outputs of the source after the switch position has been decided. Figure |4] shows the AVS in the 
noncausal setting, where the switcher has the subsource realizations before choosing the switch position. 



B{l/4) 



a;i,2, a;i,3, 



^(1/3) 



i 1 



a;2,i, a;2,2, a;2,3, ■ 



Switch 
Selection 



Sl, S2, 



Fig. 3. The adversary chooses the switch position with knowledge only of the past AVS outputs. For Hamming distortion, the rate-distortion 
function is R{D) = /ib(l/3) - ht{D) for D e [0, 1/3]. 



fi(l/4) 



^(1/3) 



3:^1,1, a;i,2, a;i,3, 



a;2,i, a;2,2, a;2,3, 



Switch 
Selection 



Sl, S2, 



Xl,X2, 



Fig. 4. The adversary chooses the switch position with knowledge of both subsource realizations. For Hamming distortion, the rate-distortion 
function is R{D) = 1 - hi(D) for D € [0, 1/2]. 



For any time k, 



P{xi,k = X2,k = 0) 
P(xi,fc = X2,k = 1) 
P{{xi,k,X2,k} = {0,1}) 



3 


2 


1 


4 


' 3 ~ 


2 


1 


1 


1 


4 


' 3 ~ 


12 




1 


1 


1 


~2~ 


12 



(30) 
(31) 
(32) 



If the switcher is allowed 1-step lookahead and has the option of choosing either or 1, suppose the switcher 
chooses 1 with probability fi. The coder then sees an IID binary source with a probability of a 1 occurring being 
equal to: 

P(l) = ^ + ^/,. (33) 
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conv(^) 
1/4 / 1/3 



1/12 



C 



P{x = 1) 



1/2 



Fig. 5. The binary distributions tlie switciier can mimic. conv(C/) is tlie set of distributions tlie switcher can mimic with causal access to 
subsource realizations, and C is the set attainable with noncausal access. 



B{l/4) 



B{l/3) 



a;i,2, a;i,3, ■ 



a;2,i, 3^2,2, a;2,3. 



Switch 
Selection 



Sl, S2, 



Xi,X2, ■ 



Fig. 6. The adversary observes the mod-2 sum of the two subsources, a Bernoulli 1/3 subsource and a Bernoulli 1/4 subsource. For 
Hamming distortion, the rate-distortion function is R{D) = hb{l/3) — hb{D) for D G [0, 1/3]. 



By using /i as a parameter, the switcher can produce I's with any probability between 1/12 and 1/2. The 
attainable distributions are shown in Figure |5] The switcher with lookahead can simulate a significantly larger 
set of distributions than the causal switcher, which is restricted to outputting I's with probability in [1/4,1/3]. 
Thus, for the strictly causal switcher, R{D) = /ife(l/3) — hi,{D) for D G [0, 1/3] and for the switcher with 1-step 
or full lookahead, R{D) = 1 - hb{D) for D G [0, 1/2]. 

We now look at several variations of this example to illustrate the utility of noisy or partial observations of the 
subsources for the switcher. In the first variation, shown in Figure |6l the switcher observes the mod-2 sum of the 
two subsources. Theorem 14. 1 1 then implies that R{D) = /ifc(l/3) — hi){D) for D G [0, 1/3]. Hence, the mod-2 sum 
of these two subsources is useless to the switcher in deciding the switch position. This is intuitively clear from the 
symmetry of the mod-2 sum. If t = 0, either both subsources are or both subsources are 1, so the switch position 
doesn't matter in this state. If t = 1, one of the subsources has output 1 and the other has output 0, but because 
of the symmetry of the mod-2 function, the switcher's prior as to which subsource output the 1 does not change 
and it remains that subsource 2 was more likely to have output the 1. 

In the second variation, shown in Figure |7J the switcher observes the second subsource directly but not the 
first, so tk = X2^k for all k. Using Theorem 14. 1 1 again, it can be deduced that in this case R{D) = 1 — hi,{D) 
for D G [0, 1/2]. This is also true if tk = xi^k for all k, so observing just one of the subsources noncausally is 
as beneficial to the switcher as observing both subsources noncausally. This is clear in this example because the 
switcher is attempting to output as many I's as possible. If t = 1, the switcher will set the switch position to 2 and 
if t = 0, the switcher will set the switch position to 1 as there is still a chance that the first subsource outputs a 1. 

For this example, the helpful cheater with 1-step lookahead has a rate-distortion function that is upper bounded 
by /ib(l/12) — hb{D) for D G [0, 1/12]. The rate-distortion function for the helpful cheater with full lookahead can 
be computed from Theorem 15.11 In Figure [8j the rate-distortion function is plotted for the situations discussed so 
far. In an active sensing situation, we see that there can be a large gap between the required rates for adversarially 
modelled active sensors and sensors which have been jointly optimized with the coding system. 
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B{l/A) 



a;i,2, a;i,3, ■ 



Xi,X2, ■ 



8(1/3) 



3^2,1, a;2, 2, 2:2,3, 



tl, t2, 



Switch 
Selection 



Sl, S2, 



Fig. 7. The adversary observes the second subsource perfectly, but does not observe the first subsource. For Hamming distortion, the 
rate-distortion function is R{D) = 1 - hi,{D) for D G [0, 1/2], 



R(D) for Bernoulli 1/3 and 1/4 example 




Fig. 8. R{D) for the cheating switcher and the non-cheating switcher. Also, the rate-distortion function for the examples of Figures |6| and 
III 



12 



a;i,2, a;i,3, ■ 



^(1/3) 



2^2,1, 2^2,2, a;2, 3, 



BSC{5) 



tl,t2, 



Switch 
Selection 



Sl, S2, 



Fig. 9. The adversary observes the second subsource transmitted over a binary symmetric channel with crossover probabihty S. For 
Hamming distortion, the rate-distortion function is R{D) = ht{l/3) - hb{D) for D e [0,1/3] if 5 G [2/5,1/2]. If 5 G [0,2/5), 
RiD) = ^6(1/2 - 55/12) - ht{D) for D G [0, 1/2 - 55/12]. 




Fig. 10. R{D) as a function of the noisy observation crossover probability 5 for two different distortions for the example of Figure |9] 



Finally, in Figure |9l an adversarial switcher observes the second subsource through a binary symmetric channel 
with crossover probability 5 € [0, 1/2]. Applying Theorem 14. 11 again, it can be shown that if 5 G [0, 2/5], 



R{D) = hb 



and if (5 G [2/5,1/2], 



R{D) = h 



12 



hb{D), D G 



hiD), D G 



0, 



0, 



12 



(34) 



(35) 



Here, increasing S decreases the switcher's knowledge of the subsource realizations. Somewhat surprisingly, the 
utility of the observation is exhausted at 5 = 2/5, even before the state and observation are completely independent 
at (5 = 1/2. This can be explained through the switcher's a posteriori belief that second subsource output was a 1 
given the state. If the switcher observes t = I and 6 < 1/2, p{x2^k = l|ifc = 1) > 1/3 > 1/4 so the switch position 
will be set to 2. When the switcher observes t = 0, if (5 < 2/5, p{x2^k = l|ifc = 0) < 1/4, so the switch will be 
set to position 1. However, if 6 > 2/5, p{x2^k = = 0) > 1/4, so the switch position will be set to 2 even if 
t = because the switcher's a posteriori belief is that the second subsource is still more likely to have output a 1 
than the first subsource. Figure [TOl shows R{D) for this example as a function of 6 for two values of D. 
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R(D) for two Bernoulli 1/2 sources 




Fig. 11. The R{D) function for a helpful switcher with full lookahead. For 1-step lookahead, the upper bound is shown. 

B. Two Bernoulli 1/2 subsources 

Suppose m = 2, and both subsources are Bernoulli 1/2 IID processes. For this example, the rate-distortion 
function is R{D) = 1 — hh{D) for D G [0,1/2] whether the adversarial switcher is strictly causal, causal or 
noncausal. When the helpful switcher has 1-step lookahead, R{D) < Ru{D) = hb{l/A) - hb{D) for D G [0, 1/4]. 
One can also think of this upper bound as being the rate-distortion function for the helpful switcher with 1-step 
lookahead that is restricted to using memoryless, time-invariant rules. Using Theorem 9.4.1 of [21], one can show 
that when the switcher has full lookahead, 

R{D) = R*{p, D) = ^[l- hb{2D)] , De[0, 1/4]. (36) 

The plot of these functions in Figure [TT] shows that the rate-distortion function can be significantly reduced if the 
helpful switcher is allowed to observe the entire block of subsource realizations. It is also interesting to note how the 
switcher with full lookahead helps the coder achieve a rate of R*{f5, D). In this example X* = {{0}, {1}, {0, 1}}, 
p({0},x) = 1(0 x), p({l},x) = 1(1 / x), p({0, l},x) = and /? = (1/4,1/4,1/2). The R*{(3,D) achieving 
distribution on X is (1/2,1/2), but R*{j3,D) < 1 — hb{D). The coder is attempting to cover strings with types 
near (1/2, 1/2) but with far fewer codewords than are needed to do so. This problem is circumvented through the 
aid provided by the switcher in pushing the output of the source inside the Hamming D-ball of a codeword. This 
is in contrast to the strategy that achieves Ru{D), where the switcher makes the output an IID sequence with as 
few I's as possible and the coder is expected to cover all strings with types near (3/4, 1/4). 

VII. Computing R{D) for an AVS 

The R{D) function for an AVS with either causal or noncausal access to the subsource realizations is of the 
form 

R{D) = in&xR{p,D), (37) 

where Q is a set of distributions in V{X). In ([TT] ). ( fT9l ). and (l23l ) Q is defined by a finite number of linear 
inequalities and hence is a polytope. The number of constraints in the definition of Q is exponential in \X\ or 
\T\ when the adversary has something other than strictly causal knowledge. Unfortunately, the problem of finding 
R{D) is not a convex program because R{p, D) is not a concave n function of p in general. In fact, R{p, D) may 
not even be quasi-concave and may have multiple local maxima with values different from the global maximum 
as shown by Ahlswede [22]. 

Since standard convex optimization tools are unavailable for this problem, we consider the question of how to 
approximate R{D) to within some (provable) precision. That is, for any e > 0, we will consider how to provide an 
approximation Ra{D) such that \Ra{D) — R{D)\ < e. Note that for fixed p, R{p, D) can be computed efficiently 
by the Blahut-Arimoto algorithm to any given precision, say much less than e. Therefore, we assume that R{p, D) 
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can be computed for a fixed p and D. We also assume D > Dmin(Q) since otherwise R{D) = oo. Checking this 
condition is a linear program since Q is a polytope and Dynm{p) is linear in p. 

We will take a 'brute-force' approach to computing R{D). That is, we wish to compute R{p,D) for (finitely) 
many p and then maximize over the computed values to yield Ra{D). Since R{p,D) is uniformly continuous 
in (PtD) and hence in p, it is possible to do this and have \Ra{D) — R{D)\ < e provided enough distributions 
p are 'sampled'. Undoubtedly, there are other algorithms to compute R{D) that likely have better problem-size 
dependence. In this section, we are only interested in showing that R{D) can provably be computed to within any 
required precision with a finite number of computations. 



A. Uniform continuity of R{p, D) 

The main tool used to show that the rate-distortion function can be approximated is an explicit bound on the 
uniform continuity of R{p,D) in terms of \\p — q\\i = Ylxex \pi^) ~ distortion measures that allow 

for 0-distortion to be achieved regardless of the source. In [20], a bound on the continuity of the entropy of a 
distribution is developed in terms of ||p — 

Lemma 7.1 (£i bound on continuity of entropy [20]): Let p and q be two probability distributions on X such 
that \\p - q\\i < 1/2, then 

\H{p)-H{q)\ < Hp -gill In . (38) 

\\p - qWi 

In the following lemma, a similar uniform continuity is stated for R{p, D). The proof makes use of Lemma ITT] 
Lemma 7.2 (Uniform continuity of R{p, D)): Let d : X X ^ [0, d*] be a distortion function, d is the 
minimum nonzero distortion from ([T]). Also, assume that for each x ^ X, there is an xq{x) G X such that 

d{x,XQ{x)) = 0. Then, for g E V{X) with \\p — q\\i < for any D > 0, 

7d* \X\\X\ 
\R{p,D)-R{q,D)\ < -^||p-g||iln " . (39) 

^ d Wp-qWi 

Proof: See Appendix IIIII ■ 

The restriction that d{x, •) has at least one zero for every x can be relaxed if we are careful about recognizing 
when R{p,D) is infinite. For an arbitrary distortion measure d : X x X ^ [0, d*], define 

do{x,x) = d{x,x) — mmd{x,x). (40) 

Now let dQ = m.aXx^xdo{x,x) and do = min(^ ^^.^^(j. j)>o ^^0(2^) 5?)- We have defined do{x,x) so that Lemma 
applies, so we can prove the following lemma. ^ 
Lemma 7.3: Let p,q £ V{X) and let D > max(L'min(p), -Dmin(i?))- If \\p - q\\i < do/Ad*, 



Proof: See Appendix IIVI 



lid* \X\\X\ 
\R{p, D) - R{q, D)\<^\\p- gill In , " , . (41) 

do IIP-^lli 



As \\p — q\\i goes to 0, —In \\p — q\\i goes to infinity slowly and it can be shown that for any 5 G (0, 1) and 

7e|0,l/2|, 

,i„mm<(wifi>\i-^ ,42, 

7 eo 

In the sequel, we let 7(7) = 7ln ' ' for 7 G [0, 1/2] with /(O) = by continuity. It can be checked that / 
is strictly monotonically increasing and continuous on [0, 1/2] and hence has an inverse function g : /([0, 1/2]) — > 
[0, 1/2], i.e. g{f{'y)) = 7 for all 7 G [0, 1/2]. Note that g is not expressible in a simple 'closed-form', but can be 
computed numerically. 
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B. A bound on the number of distributions to sample 

Returning to the problem of computing R{D) in equation (l37l) . consider the following simple algorithm. Without 
loss of generality, assume A' = {1,2, ... , \XW. Let 7 € (0, 1) and let 7Zl'^l^^ be the \X\ — 1 dimensional integer 
lattice scaled by 7. Let O = [0, flT^''^'"^- Now, define 



O 



qeV{X) ■ q{{)=q{i),i = l,...,\X\-l, 

'z(I^I) = i-ES~'^W>o 



(43) 



In words, sample the \X\ — 1 dimensional unit cube, [0, l]''^'^^, uniformly with points from a scaled integer 
lattice. Embed these points in M''^' by assigning the last value of the new vector to be 1 minus the sum of the 
values in the original point. If this last value is non-negative, the new point is a distribution in V{X). The algorithm 
to compute Ra{D) is then one where we compute R{p, D) for distributions q ^ O that are also in or close enough 
to Q. 

1) Fix a g e C If minpgQ \\p — q\\i < 2\X\j, compute R{q, D), otherwise do not compute R{q, D). Repeat for 
all g G 0. 

2) Let Ra{D) be the maximum of the computed values of R{q, D), i.e. 

Ra{D) = max\R{q,D) : q G O, min ||p - < 2|^|7l . (44) 
[ peQ J 

Checking the condition minpgg ||p — ^||i < 721-^1 is essentially a linear program, so it can be efficiently solved. 
By setting 7 according to the accuracy e > we want, we get the following result. 

Theorem 7.1: The preceding algorithm computes an approximation Ra{D) such that \Ra{D) — R{D)\ < e if 



7< 



1 



(45) 



2|A'r \ lld* 

The number of distributions for which R{q, D) is computed to determine R{D) to within accuracy e is at mosj^ 



N{e)< 



2\X\ 



\x\-i 



+ 2 



(46) 



Proof: The bound on A^(e) is clear because the number of points in O is at most ([I/7] + 1)''*' ^ and every 
distribution in O is associated with one in O, so \0\ < \0\. 

Now, we prove \Ra{D) — R{D)\ < e. For this discussion, we let 7 = 2[^\9 (i^)- First, for all p G Q, there 

is a g G O with \\p - q\\i < g (^3^) = 2|;f I7. To see this, let q{i) = L^Jt for i = I, . . . ,\X\ - 1. Then q^O, 
and we let q{i) = q{i) for i = 1, . . . , \X\ — 1. Note that 

q{\X\) = l- 9« = 1- E 



i=l 



i=l 



. 7 



l^l-i 



(47) 



1=1 



Therefore q ^ O and furthermore, 

\\p - q\ 



y i=l j i=l 

< 2(1-^1-1)7 

< 2|A'|7 

edo \ 



p{i) 



7 



< 9 



lid* 



(48) 

(49) 
(50) 

(51) 



*This is clearly not the best bound as many of the points in the unit cube on do not yield distributions on V{X). The factor by which we 
are overbounding is roughly \X\\, but this factor does not affect the dependence on e. 
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By Lemma [731 R{q,D) > R{p,D) — e. This distribution q (or possibly one closer to p) will always be included 
in the maximization yielding Ra{D), so we have Ra{D) > maxpgg R{p, D) — e = R{D) — e. 
Conversely, for a g € O, if minpgg \\p — q\\i < 2\X\'y, Lemma 1731 again gives 

R{q,D) < max R{p,D) + e = R{D)+e (52) 

pes 

Therefore, \Ra{D) - R{D)\ < e. ■ 



C. Estimation of the rate-distortion function of an unknown IID source 

An explicit bound on the continuity of the rate-distortion function has other applications. Recently, Harrison and 
Kontoyiannis [23] have studied the problem of estimating the rate-distortion function of the marginal distribution 
of an unknown source. Let px" be the (marginal) empirical distribution of a vector x" G X"-. They show that the 
'plug-in' estimator R{p:x.",D), the rate-distortion function of the empirical marginal distribution of a sequence, is 
a consistent estimator for a large class of sources beyond just IID sources with known alphabets. However, if the 
source is known to be IID with alphabet size 1^"!, estimates of the convergence rate (in probability) of the estimator 
can be provided using the uniform continuity of the rate-distortion function. 

Suppose the true source is IID with distribution p G 'P{X) and fix a probability r € (0, 1) and an e € (0, In 
We wish to answer the question: How many samples n need to be taken so that |i?(px";-D) — R{p,D)\ < e with 
probability at least 1 — r? The following lemma gives a sufficient number of samples n. 

Theorem 7.2: Let d : X x ?(! —>■ [0, d*] he a distortion measure for which Lemma U?2] holds. For any p € V{X), 
T G (0, 1), and e G (0, In |;f [), then 



if 



P{\R{p^.,D)-R{p,D)\>e)<T (53) 
n> ( Ini + |A'|ln2 ) . (54) 



, Id' 

Proof: From Lemma 17.21 we have 



P{\Rip^.,D)-R{p,D)\>e) < P { \\p^„ - p\\, > g I ^]] (55) 



< 2l^lexp|--5(^^) I (56) 

The last line follows from Theorem 2.1 of [24]. This bound is similar to, but a slight improvement over, the 
method-of-types bound of Sanov's Theorem. Rather than an (n + l)''^' term, we just have a 2l'^l term multiplying 
the exponential. Taking In of both sides gives the desired result. ■ 
We emphasize that this number n is a sufficient number of samples regardless of what the true distribution 
p G V{X) is. The bound of ( |54l ) depends only on the distortion measure d, alphabet sizes \X\ and desired 
accuracy e and 'estimation error' probability r. 

VIII. Concluding Remarks 

As mentioned in the introduction, the active-source problem is truly interesting when the sources have memory. 
Dobrushin [25] has analyzed the case of the non-anticipatory AVS composed of independent sources with memory 
with different distributions when the switcher is passive and blindly chooses the switch position. In the case of 
sources with memory, additional knowledge will no doubt increase the adversary's power to increase the rate- 
distortion function. If we let R^''\D) be the rate-distortion function for an AVS composed of sources with memory 
and an adversary with k step lookahead, one could imagine that in general, 

R^^\d) < < < . . . < r(^\d). (57) 

Another interesting problem, at least mathematically, is the arbitrarily varying channel formulation analogous to 
the problems of Sections JII] and |IVl Similar techniques to those developed here might prove useful in considering 
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a cheating 'jammer' for an arbitrarily varying channel. While the problem is well defined, it seems unphysical in 
the usual context of jamming or channel noise. The idea may make more sense in the context of watermarking, 
where the adversary can try many different attacks on different letters of the input before deciding to choose one 
for each. 

For the original motivation of compressing active-vision sources, the results here suggest that treating it as an 
adversarial black box might be overly conservative. There is a large gap between the adversarial and helpful rate- 
distortion functions. This suggests that an interesting question to study is one of mismatched objectives where the 
switcher is trying to be helpful for some particular distortion metric but the source is actually being encoded with a 
different metric in mind. Finally, if the active-sensor and coding system are part of a tightly delay-constrained control 
loop, we would want to study these issues from the causal source code perspective of [13]. It seems likely that 
the adversarial results of Theorems 13.11 and 14.11 would follow straightforwardly with the same sets of distributions 
C and T>, with the IID rate-distortion function for noncausal source codes replaced by the the IID rate-distortion 
functions for causal source codes. 

Appendix I 
Proof of Theorem 13. II 

A. Achiev ability for the coder 
The main tool of the proof is: 

Lemma 1.1 (Type Covering): Let S'dIX") = {x" G Af" : d„(x",X") < B} be the set of X"" strings that are 
within distortion of a Af" string x". Fix a p S Vn{X) and an e > 0. Then for all n large enough, there exist 
codebooks B = {X"(1),X"(2), . . . ,x"(M)} where M < exp(n(i?(p, L>) + e)) and 

r; C J SDi^n, (58) 

where Tp is the set of X"- strings with type p. 

Proof: See [5], Lemma 1. ^ ■ 

We now show how the coder can get arbitrarily close to R{D) for large enough n. For 6 > 0, define Cs as 

f E..evPW>^(^ieV,l</<m)-5 ] 

Cs= I pe V{X) : V V such that \ . (59) 

i V c A' J 

Lemma 1.2 (Converse for switcher): Let e > 0. For all n sufficiently large 

- In Af (n, D) < R{D) + e. (60) 
n 

Proof: We know R{p, D) is a continuous function of p ([19]). It follows then that because is monotonically 
decreasing (as a set) with 5 that for all e > 0, there is a 5 > so that 

m8ixR{p,D) < maxR{p,D) + e/2. (61) 
peCa pec 

We will have the coder use a codebook such that all X^ strings with types in Cs are covered within distortion 
D. The coder can do this for large n with at most M codewords in the codebook B, where 

M < (n+ exp(nmaxi?(p,L>)) (62) 

P&Cs 

< exp(n(max D) + e)). (63) 

Explicitly, this is done by taking a union of the codebooks provided by the type-covering lemma and noting that 
the number of types in Vn{X) is less than (n + l)''^'. Next, we will show that the probability of the switcher being 
able to produce a string with a type not in goes to exponentially with n. 

Consider a type p £ Vn{X) fl {'P{X) — Cs). By definition, there is some V C A* such that ^^xevPi^) ^ S 
V,l < I < m) — 6. Let Ca,(V) be the indicator function 

m 

a(V) =niKfc G V). (64) 

1=1 
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Cfe indicates the event that the switcher cannot output a symbol outside of V at time k. Then Cjfc(V) is a Bemoulh 
random variable with a probability of being 1 equal to k(V) = P{xi £ V,l < I < m). Since the subsources are 
IID over time, Cfc(^) is a sequence of IID binary random variables with distribution q' = [1 — k(V), k(V)). 

Now for the type p G Vn{X) n {V{X) - Cs), we have that for all strings in the type class Tp, ^ Ya=i '^i.^i ^ 
V) < niy) — 6. Let p' be the binary distribution (1 — k{V) + 6, k(V) — S). Therefore \\p' — q'\\i = 26, and hence 
we can bound the binary divergence D{p'\\q') > 26^ by Pinsker's inequality. Using standard types properties [20] 
gives 

p(-f2'^k{V)<KiV)-s] < {n + l)eM-nDip'\\q')) (65) 
fe=i 

< (n + l)exp(-2n52). (66) 
This bound holds for all V C A", V 7^ 0, so we sum over types not in Cs to get 

Pip^n^Cs) < (n + l)exp(-2n5^) (67) 

< (n + 1)1-^1 cxp(™2n(52) (68) 
= e.p(-n(26^-\A:\'^^^^^)). (69) 



Then, regardless of the switcher strategy, 

E[d(x";H)] <L> + d*-exp|^-n(^2<52-|A'|i^^^^^^^y (70) 

So for large n we can get arbitrarily close to distortion D while the rate is at most R{D) + e. Using the fact 
that the IID rate-distortion function is continuous in D gives us that the coder can achieve at most distortion D on 
average while the asymptotic rate is at most R{D) + e. Since e is arbitrary, R{D) < R{D). ■ 

B. Achievability for the switcher 

This section shows that R{D) > R{D) when the switcher has 1-step lookahead. We will show that the switcher 
can target any distribution p £ C and produce a sequence of IID symbols with distribution p. In particular, the 
switcher can target the distribution that yields maxpgc ^(p, -D), so R{D) > R{D). 

The switcher will use a memoryless randomized strategy. Let V Q X and suppose that at some time k the 
set of symbols available to choose from for the switcher is exactly V, i.e. {xi ^, . . . ,Xm,k} = V. Recall /3(V) = 
P{{xi^i, . . . ,Xm,i} = V) is the probability that at any time the switcher must choose among elements of V and 
no other symbols. Then let f{x\V) be a probabiUty distribution on X with support V, i.e. f{x\V) > 0, V x G Af, 
/(x|V) = if X ^ V, and J^xev /(-^l^) — 1- switcher will have such a randomized rule for every nonempty 
subset V of X such that |V| < m. Let V be the set of distributions on X that can be achieved with these kinds of 
rules 

f P(-)=EvCA',|V|<rnW/(-|V), ] 

V = I pe V{X) : V V s.t. V <^X, \V\<m, > . (71) 
\ /(-IV) is a PMF on V J 

It is clear by construction that V C.C because the conditions in C are those that only prevent the switcher from 
producing symbols that do not occur enough on average, but put no further restrictions on the switcher. So we need 
only show that C QT>. The following gives such a proof by contradiction. 

Lemma 1.3 (Achievability for switcher): The set relation C C 2? is true. 

Proof: Without loss of generality, let X = {1, . . . , Suppose p e C but p ^ V. It is clear that V is a 
convex set. Let us view the probability simplex in RI'^L Since D is a convex set, there is a hyperplane through 
p that does not intersect V. Hence, there is a vector (ai, . . . , a\x\) such that Xll=i = ^ some real t but 

t < min^gp "^[Ji aiq{i). Without loss of generality, assume ai > 02 > . • • > a\x\ (otherwise permute symbols). 
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Now, we will construct f{-\V) so that the resulting q has Y^\Ji aip{i) > a,iq{i), which contradicts the initial 

assumption. Let 

1 if i = max(V) 







else 



(72) 



so for example, if V = {1, 5, 6, 9}, then /(9|V) = 1 and f{i\V) = if i / 9. Call q the distribution on X induced 
by this choice of f{-\V). Recall that k(V) = P{xi G V, 1 < / < m). Then, we have 

\x\ 

Y^mii) = ai^({l})+a2[K({l,2})-^({l})] + 

2=1 

• • • + a\x\ . . . , |^|}) - Ac({l, 1})] (73) 

By the constraints in the definition ( fT9l ) of C, we have the following inequalities for p: 

p{l) > K{{l})=q{l) (74) 
p{l)+p{2) > k({1,2}) =g(l) + g(2) (75) 



l^l-i 



X -1 



^ p{i) > K({1,...,|^|-1})= qii). 



i=l 



Therefore, the difference of the objective is 

\x\ 

Ya^{p{i) - q{i)) ■ 



1=1 



\x\ 



i=l 



i=l 



+ 



h (ai - 02) 

\x\-i 

^ [ai-ai+i] 

i=l 



\X\-l 



^ p{i) - q{i) 



i=l 



+ 



p{l)-q{l) 



> 0. 



(76) 



(77) 



(78) 



(79) 



The last step is true because of the monotonicity in the Oj and the inequalities we derived earlier. Therefore, we 
see that ^\Ji aip{i) > "^iJi a,iq{i) for the p we had chosen at the beginning of the proof. This contradicts the 
assumption that Yl\=i o-iP{i) < miiiqei' Yli=i o-iqii), therefore it must be that CCD. ■ 



Appendix II 
Proof of Theorem 14.11 

It is clear that R{D) > maxpi^D^^^^^^ R{p, D) because the switcher can select distributions f{-\t) G Q{t) for 
all t G T and upon observing a state t, the switcher can randomly select the switch position according to the 
convex combination that yields f{-\t). With this strategy, the AVS is simply an IID source with distribution p(-) = 
Eta(t)/(-|t)- Hence, R{D) > maxpei,^,„,„ R{p,D). 

We will now show that R{D) < maxpgx>sf„t„ ^(P; This can be done in the same way as in Appendix IH We 
can use the type covering lemma to cover sequences with types in or very near Vstates and then we need only 
show that the probability of x" having a type e far from Vstates goes to with block length n. 

Lemma 2.1: Let be the type of x" and for e > let 'Dstates,e be the set of p G 'P(Af) with £1 distance at 
most e from a distribution in Vstates- Then, for e > 0, 

P(Px'. i Vstates,e) < MVl^l exp{-n^{e)), (80) 
where ^(e) > for all e > 0. So for large n, px" is in Vstates,e with high probability. 
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Proof: Let t" be the n-length vector of the observed states. We assume that the switcher has advance knowledge 
of all these states before choosing the switch positions. First, we show that with high probability, the states that 
are observed are strongly typical. Let A^(t|t") be the count of occurrence of t G T in the vector t". Fix a 5 > 
and for t e T, define the event 

- a{t) 



4 



n 



>6}. (81) 



oo 
oo 



Since A^(t|t") = J2i=i = ^^'^ each term in the sum is an IID Bernoulli variable with probability of 1 equal 
to a{t), we have by Hoeffding's tail inequality [26], 

P{Al) < 2exp{-2n5'^). (82) 

Next, we need to show that the substrings output by the AVS at the times when the state is t have a type in 
or very near Q{t). This will be done by a martingale argument similar to that given in Lemma 3 of [5]. Let t 
denote the infinite state sequence {ti,t2, ■ ■ ■) and let = a{t°°) be the sigma field generated by the states t 
For i = 1, 2, . . ., let J^i = a{t°°, s*, Xj^, . . . , x^). Note that {jFjj^^Q is a filtration and for each i, the Xi is included 
in trivially because Xi = Xs,.i- 

Let Ci be the -dimensional unit vector with a 1 in the position of Xj. That is, Cj(x) = l{xi = x) for each 
X e X. Define Tj to be 

Ti = Ci-E[Ci\Ti-i] (83) 

and let So = 0. For A; > 1, 

k 

Sk = ^Ti. (84) 

1=1 

We claim that 5^, A; > 1 is a martingale^ with respect to the filtration {Ti} defined previously. To see this, note 
that E[|S'fc|] < oo for all k since is bounded (not uniformly). Also, Sk € J^k because Tj e J^i for each i. Finally, 

nSk+ilJ'k] = E[n+i + Sk\Tk] 

= E[Tk+i\Tk] + Sk 

= E[Ck+i-E[Ck+i\J'km] + Sk 

= E[Ck+i\Tk]-E[Ck+i\J'k] + Sk 

= Sk- 



Now, define for each t ^ T, 
and analogously. 



Tl = Ti- l{ti = t) (85) 



Si = ^Tl. (86) 

1=1 

It can be easily verified that Sj. is a martingale with respect to JFj for each t ^ T. Expanding, we also see that 



-S: 



iV(t|t'^) N{t\t 



1 

-— ^ra(t. = t) (87) 



i=l 



TTTT-^ V Ci-—^—- V E[Ci|.Fi_i]. (88) 



i: ti=t i: ti=t 



The first term in the difference above is the type of the output of the AVS during times when the state is t. For 
any i such that ti = t, 

m 

E[Ci\Ti-i] = P{m-i)pi{-\t) G g{t). (89) 
1=1 



'' Sk is a vector, so we show that each component of the vector is martingale. For ease of notation, we drop the dependence on the 
component of the vector until it is explicitly needed. 
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In the above, P{l\Ti--i) represents the switcher's possibly random strategy because the switcher chooses the switch 
position at time i with knowledge of events in Ti-i. The source generator's outputs, conditioned on the state at 
the time are independent of all other random variables, so ^(^l-^i-i)M('l*) the probability distribution of 

the output at time i conditioned on Ti-i. 

Thus, the second term in the difference of equation (|88] ) is in Q{t) because it is the average of N{t\t"') terms in 
Q{t) and G{t) is a convex set. Therefore, 5^/A^(t|t") measures the difference between the type of symbols output 
at times when the state is t and some distribution guaranteed to be in Q{t). 

Let be the empirical type of the string x", and let p^„ be the empirical type of the sub-string of x"^ 
corresponding to the times i when t j = t. Then, 



n 



(90) 



Let G{t)e be the set of distributions at most e in Ci distance from a distribution in G(t). Recall that for \X\ 
dimensional vectors, — q||oo<e/|'Y| implies ||p — (/Hi < e. Hence, we have 



VtGT / ter \xex ^ 



1 



S„ (x , 



N{t\t^) " 



> 



> 



Let {AgY denote the complement of the event A^. So, for every {t,x) we have 



iV(t|t") 



N{t\t^ 



-S„ (x , 



>7n,(4r 



< 2exp(-2n52) + P 



N{t\ 



In the event of {AgY, we have iV(t|t") > n{a{t) - 5), so 



P 



iV(t|t" 



-Siix) 



> 



{AIT] < P[K{x)\>niait)-6)— {Air 



< p{\Siix)\>n{ait)-5) — 



(91) 
(92) 

(93) 
(94) 

(95) 
(96) 



5^(3;) is a martingale with bounded differences since \Sl_^_-j^{x) — Sl{x)\ = \Tl_^-^{x)\ < 1. Hence, we can apply 
Azuma's inequality [27] to get 

p(\Si{x)\>n{a{t)-6)-^) < 2exp^ ^{a{t) - 6fe^ 



-n- 



2\X\ 



Plugging this back into equation 

P (\J {pir^ ^g{t)e}] < 2|T||;f| (exp(-2n52) +exp 



2\X? 



-n- 



where 



< 4\X\\r\exp{-nC{€,6)) 
2 



min < 25 



2\X\ 



a* 



min ait). 



(97) 

(98) 
(99) 

(100) 
(101) 



We assume without loss of generaUty that a^, > since T is finite. We will soon need that b < e/\T\, so let 

C(e) = max ^{e,6) (102) 

0<<5<min{e/|r|,o.} 
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and note that it is always positive provided e > 0, since ^(e,5) > whenever 6 € (0, a*). Hence, 



(103) 



We have shown that with probability at least 1 — 4|A'||T| exp(— n^(e)), for each t £ T there is some € Q{t) 
such that II^Jx" — < e and {A^^^^^^Y occurs. Let 



By construction, p G Vstates- To finish, we show that \\px" — p\\i < 2e. 

l|Px"-p||l ^ 



^ |Px"(2;) -p{x)\ 



< 



EE 

t X 



^' ^ p^„(x)-a(ty(x) 
n 



na(i) 



Af(t|t 



t X 

From (ISTI ). we are assumed to be in the event that 

7V(t|t") 



na{t) 



pi^x). 



na{t) 



< 



a{t) 



Hence, 



IIpx" -p 



= e + \T\6<2e. 



(104) 

(105) 

(106) 
(107) 
(108) 
(109) 

(110) 

(111) 
(112) 



We have proved P{px" ^ T^states,2e) < 4|,^j|T| exp(— n^(e)), so we arrive at the conclusion of the lemma by 
letting ^(e) = i{e/2). 



Appendix III 
Proof of Lemma [7T2] 

Let W* j^ = arg min^g-^(p_^)/(p, W"). Then 

\R{p,D) - R{q,D)\ = \I{p,W;^d) - IiQ,Wl^)\. 
Consider d(p, W* j^), the distortion of source p across q's distortion D achieving channel. 

dip,Wlo) < d{q,W*o) + \d{p,Wlo)-diq,W*D)\ 



= d{q,Wlo) + 

< D+^\p{x: 



^^ip{x) - q{x))W*j^ix\x)d{x,x) 

X X 

- Q{x)\Ewlj:){x\x)d{x,x) 



< D+\\p-q\\id* 



(113) 

(114) 
(115) 

(116) 
(117) 
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By definition, PF*^ is in >V(p, W^*^)), so R{p,d{p,Wl^)) < I{p,W*j,). 

Rip,dip,Wlj,)) < /(p,%*z?) (118) 

< I{q,Wli,) + \I{p,W*^i,)-I{q,Wlo)\ (119) 
= R{q,D) + \I{p,Wlj,)-I{q,Wlj,)\. (120) 

Expanding mutual informations yields 

\Hp,WIo) - Iiq,Wlj,)\ = \Hip) + HipW*o)-Hip,Wlo)--- (121) 

-H{q)-H{qWljy)+H{q,Wljy)\ 
< \H{p) - Hiq)\ + \H{pWli,) - H{qWli,)\ + ■■■ 

\Hip,Wlr))-H{q,Wl^)\. (122) 

Above, for a distribution p on ^ and channel W from X to X, H{pW) denotes the entropy of a distribution on 
X with probabilities {pW){x) = J2xP(^)^(.^\^)- H{P^^) denotes the entropy of the joint source on X x X 
with probabilities {p,W){x,x) = p{x)W{x\x). It is straightforward to verify that \\pW — qW\\i < \\p — q\\i and 
W^) — (g, VF)||i < Hp — g 111- So using Lemma ITT] three times, we have 

\I{p,WIj,) - Iiq,Wln)\ < ||p-g||iln^^ + ||p-<?||iln^^ + 

||p-<?||ilniM_ (123) 
l|p-9l|i 

< 3||p-g||iln^^H_. (124) 

Hp - ^lli 

Now, we have seen d{p, W* j^) < D + d*\\p — q\\i. We will use the uniform continuity of R{p, D) in D to bound 
\R{p, D) — R{p, D + d*\\p — q\\i)\. This will give an upper bound on R{p, D) — R{q, D) as seen through equation 
( 11201 ). namely, 

R{p, D) - R{q, D) < \I{p, Win) - I{q, 1^*^)1 + R{p, D) - R{p, d{p, Wl^)) (125) 
< \I{p,Wlr))-Hq,Wlr))\+R{p,D)-R{p,D + d*\\p-q\U), (126) 

where the last step follows because R{p, D) is monotonically decreasing in D. For a fixed p, the rate-distortion 
function in D is convex U and decreasing and so has steepest descent at D = 0. Therefore, for any < -Di, -D2 < d* , 

\R{p,Di)-R{p,D2)\ < \R{p,0)-R{p,\D2-Di\)\. (127) 

Hence, we can restrict our attention to continuity of R{p, D) around = 0. By assumption, W{p, 0) 7^ Vp € 
■p(Af). Now consider an arbitrary D > 0, and let W G W(p, D). We will show that there is some Wq € W(p, 0) 
that is close to W in an >Ci-like sense (relative to the distribution p). Since W G D), we have by definition 

D > '^p{x)^W{x\x)d{x,x) (128) 

X X 

= ^p{x) W{x\x)d{x,x) (129) 

X x: d{x,x)>0 

d^Pix) Yl W{x\x). (130) 

X x: d{x,x)>(} 



> 



Now, we will construct a channel in W(p, 0), denoted Wq. First, for each x, x such that d{x, x) = 0, let V{x\x) = 
W{x\x). For all other {x,x), set V{x\x) = 0. Note that V is not a channel matrix if W ^ W(p, 0) since it is 
missing some probability mass. To create Wq, for each x, we redistribute the missing mass from T/(-|a;) to the 
pairs {x, x) with d{x, x) = 0. Namely, for (x, x) with d{x, x) = 0, we define 



24 



For all (x, x) with d{x, x) > 0, define Wo{x\x) = 0. So, Wq is a valid channel in W(p, 0). Now for a fixed x G r^", 
- Wo(2k)| = Yl Wix\x)+ \W{x\x) -Wo{x\x)\ (132) 

X x: d(x,x)>0 x: d{x,x)=0 

Y W{x\x) + --- (133) 

x: d{x,x)>0 

^x': d{x,x')>0 W{x'\x) 



x: d{x,x)=0 

x: d{x,x)>0 



W{x\x) - W(x\x) 



\{x' : d{x,x') = 0}| 



Therefore, using (11301 ) 



2D 



(134) 



(135) 



So, for 14/^ = W* j^, there is a Wq € W{p,0) with the above 'modified £i distance' with respect to p between W 
and Wq being less than 2D/d. Going back to the bound on 0) — R{p,D)\, 



mill I{p,W)-I{p,Wli,) 



|i?(p,0)-i?(p,I))| 

< i{p,Wo) - i{p,w;^d) 

< \HipWo) - HipW;^^) I + \H{p, Wo) - H{p, Wp*^) 
Now, note that the Ci distance between pWo and pW* is 

\\pWo-pWlij\\i = ^ ^p(x)VFo(x|x) -p(x)W;*^(x|x) 

X X 

< ^p(x) Y \Wo(x\x) - Wp*fl(21x)l 



< 



X 

2D 



Similarly, ||(p,VFo) - < 

Now, assuming D < d/A, we can again invoke Lemma ItTT] to get 

\Rip,0)-Rip,D)\ < ^In^ + ^ln^ 

4D d\X\\X\ 



2D 



Going back to (1126b . we see that if \\p - q\\i < -A^, 



< 



d 2d* Hp -gill 
4rf1|p-g||i ^^ l^ll^l 

d ""iip-^iir 

The last step follows because d/d* < 1. Substituting into equation (11261 ) gives 

R{p,D)-R{q,D) < 3||p-g||iln j'^l'^l + 4C ||p - In ■ 



IIp-^IIi d 



Hp - 



(136) 

(137) 
(138) 



(139) 
(140) 

(141) 



(142) 
(143) 

(144) 
(145) 

(146) 
(147) 
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Finally, this bound holds uniformly on p and q as long as the condition on |[p — ^z||i is satisfied. Therefore, we can 
interchange p and q to get the other side of the inequality. 



7d* \X\\X\ 
R{q,D)-R{p,D) < -^||p-g||iln " . (148) 

d \\p-q\\i 



This concludes the proof. 



Appendix IV 
Proof of Lemma [73] 

We now assume d : ?(! x X —>■ [0, d*] to he arbitrary. However, we let 

do{x,x) = d{x,x) — 'nimd{x,x) (149) 

so that Lemma fL2\ applies to do- Let Ro{p,D) be the IID rate-distortion function for p G 'P(A') at distortion D 
with respect to distortion measure do{x,x). By definition, R{p,D) is the IID rate-distortion function for p with 
respect to distortion measure d{x,x). From Problem 13.4 of [20], for any D > I?min(p)> 

R{p, D) = Roip, D - D^in(p)). (150) 
Hence, for p,q £ V{X), D > mayi{D^in{p), D^miq)), 

\R{p,D)-R{q,D)\ = \Ro{p,D-D^M)-Ro{q,D-D^i^{q)\ (151) 
< \Rg{p,D- D^^ (p))-R^(p,D- (g) ) I + 

\Ro{p, D - D^M) - Ro{q, D - Anin('7))l- (152) 

Now, we note that |-Dmin(p) — ^min(9)| < d*\\p — q\\i. The first term of equation (11521 ) can be bounded using 
equation (11431 ) and the second term of (11521 ) can be bounded using Lemma 17.21 The first term can be bounded 
if \\p — q\\i < do/Ad* and the second can be bounded if \\p — q\\i < do/AdQ. Since d^ < d*, we only require 
\\p-q\\i < do/4d*. 

\R{p,D)-R{q,D)\ < ^\\p-qh In + - g|K In (153) 

do 2d*\\p-q\\i do IIp-^IIi 

^ Ad\, „ , lA'll^l 7d*,, „ , l-^'ll^l ,,,,, 
< p — g 1 m -r — \- ^^\\p — q\\i m —. (154) 

do Wp-qWi do \\p-Q\\i 

References 

[1] H. Palaiyanur, C. Chang, and A. Sahai, "The source coding game with a cheating switcher," in Proc. Int. Symp. Inform. Theory, Nice, 
France, June 2007. 

[2] R. Bajcsy, "Active perception," Proceedings of the IEEE, vol. 76, no. 8, pp. 966-1005, Aug. 1988. 

[3] A. Chebira, P. Dragotti, L. Sbaiz, and M. Vetterli, "Sampling and interpolation of the plenoptic function," in Proc.of IEEE International 

Conference on Image Processing, Barcelona, Spain, Sept. 2003. 
[4] P. Longman, "The best care anywhere," Washington Monthly, Jan. 2005. 

[5] T. Berger, "The source coding game," IEEE Transactions on Information Theory, vol. 17, pp. 71-76, Jan. 1971. 
[6] C. Shannon, "Channels with side information at the transmitter," IBM J. Res. Devel, vol. 2, pp. 289-293, Oct. 1958. 
[7] S. Gelfand and M. Pinsker, "Coding for channel with random parameters," Probl. Pered. Inform. (Prohl. Inf. Transm.), vol. 9, pp. 
19-31, 1980. 

[8] M. H. Costa, "Writing on dirty paper," IEEE Transactions on Information Theory, vol. 29, pp. 439-441, May 1983. 

[9] F. Willems, "On Gaussian channels with side information at the transmitter," in Proc. Int. Symp. Inform. Theory, Benelux, Enschede, 

The Netherlands, May 1988, pp. 129-135. 
[10] , "Signalling for the Gaussian channel with side information at the transmitter," in Proc. Int. Symp. Inform. Theory, Sorrento, Italy, 

June 2000. 

[II] U. Erez, S. S. (Shitz), and R. Zamir, "Capacity and lattice strategies for canceling known interference," IEEE Transactions on Information 
Theory, vol. 51, Nov. 2005. 

[12] M. Agarwal, A. Sahai, and S. Mitter, "Coding into a source: a direct inverse rate-distortion theorem," in Forty-fourth AUerton Conference 
on Communication, Control, and Computing, Monticello, IL, Sept. 2006. [Online]. Available: http://arxiv.org/abs/cs.IT/0610142 

[13] D. Neuhoff and R. K. Gilbert, "Causal source codes," IEEE Transactions on Information Theory, vol. 28, pp. 701-713, Sept. 1982. 

[14] T. Weissman and N. Merhav, "On causal soiu'ce codes with side information," IEEE Transactions on Information Theory, vol. 51, pp. 
4003-4013, Nov. 2005. 



26 



[15] S. Tatikonda, A. Sahai, and S. Mitter, "Stochastic linear control over a communication channel," IEEE Transactions on Automatic 

Control, vol. 49, no. 9, pp. 1549-1561, Sept. 2004. 
[16] C. Shannon, "Coding theorems for a discrete source with a fidelity criterion," in IRE Natl. Conv. Rec, 1959, pp. 142-163. 
[17] J. Wolfowitz, "Approximation with a fidelity criterion," in 5th Berkeley Symp. on Math. Stat, and Prob., vol. 1. Berkeley, California: 

University of California, Press, 1967, pp. 565-573. 
[18] D. Sakrison, "The rate-distortion function for a class of sources," Information and Control, vol. 15, pp. 165-195, Mar. 1969. 
[19] I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. New York, NY: Academic 

Press, 1997. 

[20] T. Cover and J. Thomas, Elements of Information Theory. New York, NY: John Wiley and Sons, 1991. 
[21] R. Gallager, Information Theory and Reliable Communication. New York,NY: John Wiley and Sons, 1971. 

[22] R. Ahlswede, "Extremal properties of rate-distortion functions," IEEE Transactions on Information Theory, vol. 36, pp. 166-171, Jan. 
1990. 

[23] M. Harrison and I. Kontoyiannis, "Estimation of the rate-distortion function," 2007. [Online]. Available: 
'http://arxiv.org/abs/cs/0702018vl 

[24] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. L. Weinberger, "Inequalities for the h deviation of the empirical distribution," 
Hewlett-Packard Labs, Tech. Rep., 2003. [Online]. Available: http://www.hpl.hp.com/techreports/2003/HPL-2003-97Rl.html 

[25] R. Dobrushin, "Unified methods for the transmission of information: The general case," Sov. Math., vol. 4, pp. 284-292, 1963. 

[26] W. Hoeffding, "Probability inequalities for sums of bounded random variables," Journal of the American Statistical Association, vol. 58, 
no. 301, pp. 13-30, Mar 1963. 

[27] K. Azuma, "Weighted sums of certain dependent random variables," Tohoku Math. Journal, vol. 19, pp. 357 - 367, 1967. 



